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Abstract 

Log-linear models provide a statistically sound 
framework for Stochastic "Unification-Based" 
Grammars (SUBGs) and stochastic versions of 
other kinds of grammars. We describe two 
computationally-tractable ways of estimating 
the parameters of such grammars from a train- 
ing corpus of syntactic analyses, and apply 
these to estimate a stochastic version of Lexical- 
Functional Grammar. 

1 Introduction 

Probabilistic methods have revolutionized com- 
putational linguistics. They can provide a 
systematic treatment of preferences in pars- 
ing. Given a suitable estimation procedure, 
stochastic models can be "tuned" to reflect the 
properties of a corpus. On the other hand, 
"Unification-Based" Grammars (UBGs) can ex- 
press a variety of linguistically-important syn- 
tactic and semantic constraints. However, de- 
veloping Stochastic "Unification-based" Gram- 
mars (SUBGs) has not proved as straight- 
forward as might be hoped. 

The simple "relative frequency" estimator 
for PCFGs yields the maximum likelihood pa- 
rameter estimate, which is to say that it 
minimizes the Kulback-Liebler divergence be- 
tween the training and estimated distributions. 
On the other hand, as Abney ( 1997| ) points 
out, the context-sensitive dependencies that 
"unification-based" constraints introduce ren- 
der the relative frequency estimator suboptimal: 
in general it does not maximize the likelihood 
and it is inconsistent. 
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Abney (1997) proposes a Markov Random 
Field or log linear model for SUBGs, and the 
models described here are instances of Abney's 
general framework. However, the Monte-Carlo 
parameter estimation procedure that Abney 
proposes seems to be computationally imprac- 
tical for reasonable-sized grammars. Sections ^ 
and ^ describe two new estimation procedures 
which are computationally tractable. Section ^ 
describes an experiment with a small LFG cor- 
pus provided to us by Xerox PARC. The log 
linear framework and the estimation procedures 
are extremely general, and they apply directly 
to stochastic versions of HPSG and other theo- 
ries of grammar. 

2 Features in SUBGs 

We follow the statistical literature in using the 
term feature to refer to the properties that pa- 
rameters are associated with (we use the word 
"attribute" to refer to the attributes or features 
of a UBG's feature structure). Let be the 
set of all possible grammatical or well-formed 
analyses. Each feature / maps a syntactic anal- 
ysis iv £ 0, to a real value f{uj). The form of 
a syntactic analysis depends on the underlying 
linguistic theory. For example, for a PCFG iv 
would be parse tree, for a LFG lu would be a 
tuple consisting of (at least) a c-structure, an f- 
structure and a mapping from c-structure nodes 
to f-structure elements, and for a Chomskyian 
transformational grammar uj would be a deriva- 
tion. 

Log-linear models are models in which the 
log probability is a linear combination of fea- 
ture values (plus a constant). PCFGs, Gibbs 
distributions, Maximum-Entropy distributions 
and Markov Random Fields are all examples of 
log-linear models. A log-linear model associates 
each feature fj with a real- valued parameter 9j . 



A log-linear model with m features is one in 
which the likelihood P(u;) of an analysis lo is: 
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While the estimators described below make 
no assumptions about the range of the fi, in 
the models considered here the value of each 
feature fi{u;) is the number of times a particu- 
lar structural arrangement or configuration oc- 
curs in the analysis to, so fi{uj) ranges over the 
natural numbers. 

For example, the features of a PCFG are 
indexed by productions, i.e., the value fi{u>) 
of feature fi is the number of times the 
ith. production is used in the derivation w. 
This set of features induces a tree-structured 
dependency graph on the productions which 
is characteristic of Markov Branching Pro- 
cesses (Pearl, 1988| ; [Frey, 1998| ). This tree 
structure has the important consequence that 
simple "relative-frequencies" yield maximum- 
likelihood estimates for the 6i. 

Extending a PCFG model by adding addi- 
tional features not associated with productions 
will in general add additional dependencies, de- 
stroy the tree structure, and substantially com- 
plicate maximum likelihood estimation. 

This is the situation for a SUBG, even if the 
features are production occurences. The uni- 
fication constraints create non-local dependen- 
cies among the productions and the dependency 
graph of a SUBG is usually not a tree. Conse- 
quently, maximum likelihood estimation is no 
longer a simple matter of computing relative 
frequencies. But the resulting estimation proce- 
dures (discussed in detail, shortly), albeit more 
complicated, have the virtue of applying to es- 
sentially arbitrary features — of the production 
or non-production type. That is, since estima- 
tors capable of finding maximum-likelihood pa- 
rameter estimates for production features in a 
SUBG will also find maximum-likelihood esti- 
mates for non-production features, there is no 
motivation for restricting features to be of the 
production type. 

Linguistically there is no particular reason 
for assuming that productions are the best fea- 
tures to use in a stochastic language model. 



For example, the adjunct attachment ambigu- 
ity in (|^) results in alternative syntactic struc- 
tures which use the same productions the same 
number of times in each derivation, so a model 
with only production features would necessarily 
assign them the same likelihood. Thus models 
that use production features alone predict that 
there should not be a systematic preference for 
one of these analyses over the other, contrary to 
standard psycholinguistic results. 

1 . a Bill thought Hillary [vp [vp left ] yesterday ] 
l.b Bill [vp[vp thought Hillary left ] yesterday ] 

There are many different ways of choosing 
features for a SUBG, and each of these choices 
makes an empirical claim about possible distri- 
butions of sentences. Specifying the features of 
a SUBG is as much an empirical matter as spec- 
ifying the grammar itself. For any given UBG 
there are a large (usually infinite) number of 
SUBGs that can be constructed from it, differ- 
ing only in the features that each SUBG uses. 

In addition to production features, the 
stochastic LFG models evaluated below used 
the following kinds of features, guided by the 
principles proposed by Hobbs and Bear ( 1995| ). 
Adjunct and argument features indicate adjunct 
and argument attachment respectively, and per- 
mit the model to capture a general argument 
attachment preference. In addition, there are 
specialized adjunct and argument features cor- 
responding to each grammatical function used 
in LFG (e.g., SUBJ, OBJ, COMP, XCOMP, 
ADJUNCT, etc.). There are features indi- 
cating both high and low attachment (deter- 
mined by the complexity of the phrase being 
attached to). Another feature indicates non- 
right-branching nonterminal nodes. There is 
a feature for non-parallel coordinate structures 
(where parallelism is measured in constituent 
structure terms). Each f-structure attribute- 
atomic value pair which appears in any feature 
structure is also used as a feature. We also use 
a number of features identifying syntactic struc- 
tures that seem particularly important in these 
corpora, such as a feature identifying NPs that 
are dates (it seems that date interpretations of 
NPs are preferred). We would have liked to 
have included features concerning specific lex- 
ical items (to capture head-to-head dependen- 
cies), but we felt that our corpora were so small 



that the associated parameters could not be ac- 
curately estimated. 

3 A pseudo-likelihood estimator for 
log linear models 

Suppose uj = is a training cor- 

pus of n syntactic analyses. Letting fj{Lo) = 
X]j=i n fj{'^i)i the log likelihood of the corpus 
Lo and its derivatives are: 

\ogU{^) = 0jfj{^)-nlogZe{2) 

j=l,...,m 

aiogL0(Lj) 
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where ^eifj) is the expected value of fj under 
the distribution determined by the parameters 
9. The maximum-likelihood estimates are the 6 
which maximize logLg(a;). The chief difficulty 
in finding the maximum-likelihood estimates is 
calculating 'Eg{fj), which involves summing over 
the space of well- formed syntactic structures 
There seems to be no analytic or efficient nu- 
merical way of doing this for a realistic SUBG. 



Abney ( 1997 ) proposes a gradient ascent, 
based upon a Monte Carlo procedure for esti- 
mating E5i(/j). The idea is to generate random 
samples of feature structures from the distribu- 
tion Pg(a;), where 9 is the current parameter 
estimate, and to use these to estimate Eg(/j), 
and hence the gradient of the likelihood. Sam- 
ples are generated as follows: Given a SUBG, 
Abney constructs a covering PCFG based upon 
the SUBG and 9, the current estimate of 9. The 
derivation trees of the PCFG can be mapped 
onto a set containing all of the SUBG's syn- 
tactic analyses. Monte Carlo samples from the 
PCFG are comparatively easy to generate, and 
sample syntactic analyses that do not map to 
well-formed SUBG syntactic structures are then 
simply discarded. This generates a stream of 
syntactic structures, but not distributed accord- 
ing to Pg(w) (distributed instead according to 
the restriction of the PCFG to the SUBG). Ab- 
ney proposes using a Metropolis acceptance- 
rejection method to adjust the distribution of 
this stream of feature structures to achieve de- 
tailed balance, which then produces a stream 
of feature structures distributed according to 
P^(a;). 

While this scheme is theoretically sound, it 
would appear to be computationally impracti- 



cal for realistic SUBGs. Every step of the pro- 
posed procedure (corresponding to a single step 
of gradient ascent) requires a very large number 
of PCFG samples: samples must be found that 
correspond to well-formed SUBGs; many such 
samples are required to bring the Metropolis al- 
gorithm to (near) equilibrium; many samples 
are needed at equilibrium to properly estimate 
E,-(/,). 

The idea of a gradient ascent of the likelihood 
@ is appealing — a simple calculation reveals 
that the likelihood is concave and therefore free 
of local maxima. But the gradient (in partic- 
ular, E5)(/j)) is intractable. This motivates an 
alternative strategy involving a data-based esti- 
mate of E6)(/j): 

Eeifj) = Ee{Ee{fj{uj)\y{u;))) (4) 
^ ^ E Ee(/,(u;)|y(^)=y,)(5) 

j=l,...,n 

where y(w) is the yield belonging to the syn- 
tactic analysis w, and yi = yiyji) is the yield 
belonging to the z'th sample in the training cor- 
pus. 

The point is that EQ{fj{uj)\y{ijj) = yi) is gen- 
erally computable. In fact, if ^{y) is the set of 
well-formed syntactic structures that have yield 
y (i.e., the set of possible parses of the string y), 
then 

E0(/j(w)|y(a;) = yi) = 



Hence the calculation of the conditional expec- 
tations only involves summing over the possible 
syntactic analyses or parses ^{yi) of the strings 
in the training corpus. While it is possible to 
construct UBGs for which the number of pos- 
sible parses is unmanageably high, for many 
grammars it is quite manageable to enumerate 
the set of possible parses and thereby directly 
evaluate Ee(/j(u;)|y(u;) = yi). 

Therefore, we propose replacing the gradient, 
dD, by 



u])\y{u}) 



Vi) 



(6) 



and performing a gradient ascent. Of course (|6|) 
is no longer the gradient of the likelihood func- 



tion, but fortunately it is (exactly) the gradient 
of (the log of) another criterion: 



PLg{uj) = Yl Pe(w = Wi|2/(w) 



(7) 



Instead of maximizing the likelihood of the syn- 
tactic analyses over the training corpus, we 
maximize the conditional likelihood of these 
analyses given the observed yields. In our exper- 
iments, we have used a conjugate-gradient op- 
timization program adapted from the one pre- 
sented in Press et al. ( 19921) . 

Regardless of the pragmatic (computational) 
motivation, one could perhaps argue that the 
conditional probabilities Fg{ui\y) are as use- 
ful (if not more useful) as the full probabili- 
ties Fg{uj), at least in those cases for which 
the ultimate goal is syntactic analysis. Berger 
et al. (|199|) and Jelinek ( pJ97| ) make this same 
point and arrive at the same estimator, albeit 
through a maximum entropy argument. 

The problem of estimating parameters for 
log-linear models is not new. It is especially dif- 
ficult in cases, such as ours, where a large sam- 
ple space makes the direct computation of ex- 
pectations infeasible. Many applications in spa- 
tial statistics, involving Markov random fields 
(MRF), are of this nature as well. In his 
seminal development of the MRF approach to 
spatial statistics, Besag introduced a "pseudo- 
likelihood" estimator to address these difficul- 



ties ([Besag, 1974| ; [Besag, 1975|) , and in fact our 
proposal here is an instance of his method. In 
general, the likelihood function is replaced by a 
more manageable product of conditional likeli- 
hoods (a pseudo-likelihood — hence the designa- 
tion PLg), which is then optimized over the pa- 
rameter vector, instead of the likelihood itself. 
In many cases, as in our case here, this sub- 
stitution side steps much of the computational 
burden without sacrificing consistency (more on 
this shortly). 

What are the asymptotics of optimizing a 
pseudo-likelihood function? Look first at the 
likelihood itself. For large n: 
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log Lg{u;) 



log n ^e{iOi) 

i=l,...,n 

E iogPe(wo 



FeA^)logPeiu;)dio (8) 



where 6o is the true (and unknown) parame- 
ter vector. Up to a constant, is the nega- 
tive of the Kullback-Leibler divergence between 
the true and estimated distributions of syntac- 
tic analyses. As sample size grows, maximizing 
likelihood amounts to minimizing divergence. 
As for pseudo-likelihood: 



1 



n 



logFLg{uj) = -log TT Fe{uj = uji\y{uj) =yi 



n 
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n 



j=l,...,n 



i=l,...,n 

^ Ee^J I'eMy)'^ogPe{io\y)dLo] 

So that maximizing pseudo-likelihood (at large 
samples) amounts to minimizing the average 
(over yields) divergence between the true and 
estimated conditional distributions of analyses 
given yields. 

Maximum likelihood estimation is consistent: 
under broad conditions the sequence of dis- 
tributions P^ , associated with the maximum 
likelihood estimator for 9o given the samples 
LL!i,...uJn, converges to Fe^. Pseudo-likelihood 
is also consistent, but in the present implemen- 
tation it is consistent for the conditional dis- 
tributions P5i^(ci;|y(Li;)) and not necessarily for 
the full distribution P^^ (see Chi ([T998[)). It is 



i=l,...,n 



not hard to see that pseudo-likelihood will not 
always correctly estimate Pg^. Suppose there 
is a feature fi which depends only on yields: 
/{{uj) = fi{y{(jj)). (Later we will refer to such 
features as pseudo-constant.) In this case, the 
derivative of PLe(i;) with respect to 9i is zero; 
PLg(u;) contains no information about 9i. In 
fact, in this case any value of 6i gives the same 
conditional distribution FQ{uj\y{ijj))] 6i is irrele- 
vant to the problem of choosing good parses. 

Despite the assurance of consistency, pseudo- 
likelihood estimation is prone to over fitting 
when a large number of features is matched 
against a modest-sized training corpus. One 
particularly troublesome manifestation of over 
fitting results from the existence of features 
which, relative to the training set, we might 
term "pseudo-maximal": Let us say that a 
feature / is pseudo-maximal for a yield y iff 



Vw' G ^{y) f{^) > f{^') where uo is any cor- 
rect parse of y, i.e., the feature's value on every 
correct parse u; of y is greater than or equal 
to its value on any other parse of y. Pseudo- 
minimal features are defined similarly. It is easy 
to see that if fj is pseudo-maximal on each sen- 
tence of the training corpus then the param- 
eter assignment 9j = oo maximizes the cor- 
pus pseudo-likelihood. (Similarly, the assign- 



ment 9i 



-oo maximizes pseudo- likelihood if 



fj is pseudo- minimal over the training corpus). 
Such infinite parameter values indicate that the 
model treats pseudo-maximal features categori- 
cally; i.e., any parse with a non- maximal feature 
value is assigned a zero conditional probability. 

Of course, a feature which is pseudo-maximal 
over the training corpus is not necessarily 
pseudo-maximal for all yields. This is an in- 
stance of over fitting, and it can be addressed, 
as is customary, by adding a regularization term 
that promotes small values of 9 to the objec- 
tive function. A common choice is to add a 
quadratic to the log-likelihood, which corre- 
sponds to multiplying the likelihood itself by 
a normal distribution. In our experiments, we 
multiplied the pseudo-likelihood by a zero-mean 
normal in 6i, . . .9^, with diagonal covariance, 
and with standard deviation aj for 9j equal to 
7 times the maximum value of fj found in any 
parse in the training corpus. (We experimented 
with other values for aj , but the choice seems to 
have little effect). Thus instead of maximizing 
the log pseudo-likelihood, we choose 9 to maxi- 



mize 



logPL0(w) 



E 

j=l,...,m 
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4 A maximum correct estimator for 
log linear models 

The pseudo-likelihood estimator described in 
the last section finds parameter values which 
maximize the conditional probabilities of the 
observed parses (syntactic analyses) given the 
observed sentences (yields) in the training cor- 
pus. One of the empirical evaluation measures 
we use in the next section measures the num- 
ber of correct parses selected from the set of 
all possible parses. This suggests another pos- 
sible objective function: choose 9 to maximize 
the number Cq{u)) of times the maximum likeli- 
hood parse (under 9) is in fact the correct parse, 



in the training corpus. 

Ce{u)) is a highly discontinuous function of 9, 
and most conventional optimization algorithms 
perform poorly on it. We had the most suc- 
cess with a slightly modified version of the sim- 
ulated annealing optimizer described in Press 
et al. ( |1992 ). This procedure is much more com- 
putationally intensive than the gradient-based 
pseudo-likelihood procedure. Its computational 
difficulty grows (and the quality of solutions de- 
grade) rapidly with the number of features. 

5 Empirical evaluation 

Ron Kaplan and Hadar Shemtov at Xerox PARC 
provided us with two LFG parsed corpora. The 
Verbmobil corpus contains appointment plan- 
ning dialogs, while the Homecentre corpus is 
drawn from Xerox printer documentation. Ta- 
ble m summarizes the basic properties of these 
corpora. These corpora contain packed c/f- 
structure representations ( [Maxwell III and Ka-| 
plan, 1995| ) of the grammatical parses of each 
sentence with respect to Lexical-Functional 
grammars. The corpora also indicate which of 
these parses is in fact the correct parse (this 
information was manually entered). Because 
slightly different grammars were used for each 
corpus we chose not to combine the two corpora, 
although we used the set of features described in 
section |2| for both in the experiments described 
below. Table Q describes the properties of the 
features used for each corpus. 

In addition to the two estimators described 
above we also present results from a baseline es- 
timator in which all parses are treated as equally 
likely (this corresponds to setting all the param- 
eters 9j to zero). 

We evaluated our estimators using held-out 
test corpus wtest- We used two evaluation 
measures. In an actual parsing application a 
SUBG might be used to identify the correct 
parse from the set of grammatical parses, so 
our first evaluation measure counts the number 
Cg(a;test) of sentences in the test corpus cDtest 
whose maximum likelihood parse under the es- 
timated model 9 is actually the correct parse. 
If a sentence has I most likely parses (i.e., all 
/ parses have the same conditional probability) 
and one of these parses is the correct parse, then 
we score l/l for this sentence. 

The second evaluation measure is the pseudo- 



Verbmobil corpus Homecentre corpus 

Number of sentences 540 980 

Number of ambiguous sentences 314 481 

Number of parses of ambiguous sentences 3245 3169 



Table 1: Properties of the two corpora used to evaluate the estimators. 

Verbmobil corpus Homecentre corpus 



Number of features 191 227 

Number of rule features 59 57 

Number of pseudo-constant features 19 41 

Number of pseudo-maximal features 12 4 

Number of pseudo-minimal features 8 5 



Table 2: Properties of the features used in the stochastic LFG models. The numbers of pseudo- 
maximal and pseudo-minimal features do not include pseudo-constant features. 



likelihood itself, PL^(u;test)- The pseudo- 
likelihood of the test corpus is the likelihood of 
the correct parses given their yields, so pseudo- 
likelihood measures how much of the probabil- 
ity mass the model puts onto the correct anal- 
yses. This metric seems more relevant to ap- 
plications where the system needs to estimate 
how likely it is that the correct analysis lies in 
a certain set of possible parses; e.g., ambiguity- 
preserving translation and human-assisted dis- 
ambiguation. To make the numbers more man- 
ageable, we actually present the negative loga- 
rithm of the pseudo-likelihood rather than the 
pseudo-likelihood itself — so smaller is better. 

Because of the small size of our corpora we 
evaluated our estimators using a 10-way cross- 
validation paradigm. We randomly assigned 
sentences of each corpus into 10 approximately 
equal-sized subcorpora, each of which was used 
in turn as the test corpus. We evaluated on each 
subcorpus the parameters that were estimated 
from the 9 remaining subcorpora that served as 
the training corpus for this run. The evalua- 
tion scores from each subcorpus were summed 
in order to provide the scores presented here. 

Table ^ presents the results of the empiri- 
cal evaluation. The superior performance of 
both estimators on the Verbmobil corpus prob- 
ably reflects the fact that the non-rule fea- 
tures were designed to match both the gram- 
mar and content of that corpus. The pseudo- 
likelihood estimator performed better than the 
correct-parses estimator on both corpora un- 



der both evaluation metrics. There seems to 
be substantial over learning in all these mod- 
els; we routinely improved performance by dis- 
carding features. With a small number of 
features the correct-parses estimator typically 
scores better than the pseudo-likelihood estima- 
tor on the correct-parses evaluation metric, but 
the pseudo-likelihood estimator always scores 
better on the pseudo-likelihood evaluation met- 
ric. 

6 Conclusion 

This paper described a log-linear model for 
SUBGs and evaluated two estimators for such 
models. Because estimators that can estimate 
rule features for SUBGs can also estimate other 
kinds of features, there is no particular reason to 
limit attention to rule features in a SUBG. In- 
deed, the number and choice of features strongly 
influences the performance of the model. The 
estimated models are able to identify the cor- 
rect parse from the set of all possible parses ap- 
proximately 50% of the time. 

We would have liked to introduce features 
corresponding to dependencies between lexical 
items. Log-linear models are well-suited for lex- 
ical dependencies, but because of the large num- 
ber of such dependencies substantially larger 
corpora will probably be needed to estimate 
such models. [| 

^Alternatively, it may be possible to use a simpler 
non-SUBG model of lexical dependencies estimated from 
a much larger corpus as the reference distribution with 



Verbmobil corpus Homecentre corpus 

C(wtest) -logPL((:}test) C{uitest) - log PL(wtost) 

Baseline estimator 9.7% 533 15.2% 655 

Pseudo-likelihood estimator 58.7% 396 58.8% 583 

Correct-parses estimator 53.7% 469 53.2% 604 

Table 3: An empirical evaluation of the estimators. C(a;test) is the number of maximum likelihood 
parses of the test corpus that were the correct parses, and — logPL(i;test) is the negative logarithm 
of the pseudo-likelihood of the test corpus. 



However, there may be applications which 
can benefit from a model that performs even at 
this level. For example, in a machine-assisted 
translation system a model like ours could 
be used to order possible translations so that 
more likely alternatives are presented before less 
likely ones. In the ambiguity-preserving trans- 
lation framework, a model like this one could be 
used to choose between sets of analyses whose 
ambiguities cannot be preserved in translation. 
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